Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add total.label argument for groupingsets, cube, rollup #5973

Merged
merged 17 commits into from
Sep 23, 2024

Conversation

markseeto
Copy link
Contributor

Closes #5351

Added a total.label argument to the rollup.data.table, cube.data.table, and groupingsets.data.table functions.

A couple of comments:

  • It was sometimes not clear what should be an error, what should be a warning, and what should be neither. I made choices that seem reasonable to me, but there would be alternatives that are also reasonable.
  • I wondered whether all.label might be a better name than total.label, since the thing being calculated won't always be a total. But I left it as total.label because the documentation uses the terminology "(sub-)totals", and the word "total" is used in other software, so "total" seems to be widely accepted.

@MichaelChirico MichaelChirico removed their request for review March 3, 2024 19:55
Copy link

codecov bot commented Mar 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.52%. Comparing base (b6d6100) to head (b6d1cf9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5973      +/-   ##
==========================================
+ Coverage   97.51%   97.52%   +0.01%     
==========================================
  Files          80       80              
  Lines       14915    14987      +72     
==========================================
+ Hits        14544    14616      +72     
  Misses        371      371              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jangorecki
Copy link
Member

As for the arg name I would keep it as "label"

R/groupingsets.R Outdated
@@ -57,6 +57,16 @@ groupingsets.data.table = function(x, j, by, sets, .SDcols, id = FALSE, jj, ...)
stopf("Argument 'sets' must be a list of character vectors.")
if (!is.logical(id))
stopf("Argument 'id' must be a logical scalar.")
if (!(is.null(total.label) ||
(is.character(total.label) && length(total.label) == 1L) ||
(is.list(total.label) && all(vapply_1b(total.label, is.character)) &&
Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rule out list because AFAIU char vector will be sufficient. Please correct me if I am wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki Thanks for reviewing. To make sure I understand you correctly, do you mean that total.label = list(Region = "National", Product = "Total") should not be allowed, and it should instead be total.label = c(Region = "National", Product = "Total")? I would have to try it to be sure, but I think that would be OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as long as each element in the list is scalar then list is not necessary and character vector is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki Thanks, I'll try that.

Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was wrong and labels must not necessarily be of character type, is that correct? Then we need a list type so mixed types can be provided.

Would be good to use such little bit more complex example so it comes up straight away.

Copy link
Contributor Author

@markseeto markseeto Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jangorecki. Currently it only supports character and factor grouping variables. For other types, it's left as NA.

In #5351 you wrote

Then we of course want total.label="Total" to be recycled to list(State = "Total", Group = "Total"). Of course only if all columns are character.

I interpreted this as meaning the label argument should only apply to variables of type character, but now I realise that maybe I interpreted incorrectly and you might have meant that total.label = "Total" should only be recycled to character variables, but non-character labels for non-character variables are still allowed in a list.

A couple of options (and there might be others):

  1. Only apply the label argument to grouping variables of type character or factor, and require label = c(Region = "National", Product = "Total") instead of label = list(Region = "National", Product = "Total").
  2. Allow the label argument to apply to grouping variables that aren't character or factor, and allow total.label to be a list so that different types can be specified for different variables.

Which do you prefer? As I mentioned above, if I had a date or integer grouping variable, I would probably still want a character total label (e.g. "Total") rather than a date or integer total label. To do this, currently the user would have to convert the variable to character or factor before using groupingsets/cube/rollup.

Maybe for non-character grouping variables, we could allow a choice of specifying a label of the same type as the variable, or a character label, and if a character label is provided then the variable will be converted to character in the output. For example, if A is character and B is integer, then label = list(A = "LabelA", B = 999L) and label(A = "LabelA", B = "LabelB") would both be allowed, and the second would result in B being character in the output. If we do that, then I think I would favour having label = "Label" apply to all variables, not just character variables.

Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing column type is not an option, column type needs to be as original columns used in 'by' and not in labels (so no for option B). Option 2. We cannot apply it to all variables because we must match column types, we don't just print results but return a data.table so saying it is only for output won't be useful. Our output is data.table and column types cannot depend on labels arg. It is the labels arg has to match to column types in the output 'by'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "output", I meant the data.table that is returned. Sorry for the confusion.

Thanks for clarifying that the column types have to stay the same. That makes it simpler.

Do you think the shorter form label = "Total" should be allowed, or should label always be a named list? If the shorter form is allowed, should it always have to be character, or would something like label = 999L be allowed and apply to all the integer grouping variables?

Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be allowed but for non-char column will result an NA, which anyway is likely to be expected. Similarly 999L can be also allowed but would result NA for non int columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll give that a try.

@markseeto
Copy link
Contributor Author

As for the arg name I would keep it as "label"

@jangorecki Thanks for your comment. Do you mean change it from "total.label" to "label", or keep it as "total.label"?

@jangorecki
Copy link
Member

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki
Copy link
Member

Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels.

@markseeto
Copy link
Contributor Author

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

@markseeto
Copy link
Contributor Author

Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels.

@jangorecki Thanks for the feedback. I can add an example to the NEWS entry.

@jangorecki
Copy link
Member

jangorecki commented Mar 4, 2024

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do.

@markseeto
Copy link
Contributor Author

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do.

OK, I can change it. Maybe "label" is better than "labels", since "labels" is a base function.

@markseeto
Copy link
Contributor Author

markseeto commented Apr 4, 2024

@jangorecki I've updated the functions, tests, documentation, and news, based on our discussions. When you have time, please let me know what you think.

I wasn't sure about the correct way to update a PR. Apologies if I've done it incorrectly.

I don't understand the reason for the red cross next to the News commit. When I click on the cross, it says "R-CMD-check / windows-latest (devel) (pull_request)" was cancelled, but in the list of checks, it says that check was successful.

Thanks.

@ben-schwen

This comment was marked as resolved.

@markseeto

This comment was marked as resolved.

@MichaelChirico
Copy link
Member

@jangorecki any final review here? I can help merge to master & other tidying.

@MichaelChirico MichaelChirico added this to the 1.17.0 milestone Sep 6, 2024
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't closely looked at the logic and test coverage but found few places which doesn't look perfect

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
R/groupingsets.R Outdated Show resolved Hide resolved
R/groupingsets.R Outdated Show resolved Hide resolved
R/groupingsets.R Outdated Show resolved Hide resolved
man/groupingsets.Rd Show resolved Hide resolved
Add 'label' argument to the groupingsets.data.table(), cube.data.table(), and rollup.data.table() functions.
Add to the Usage, Arguments, Details, and Examples sections.
@markseeto markseeto reopened this Sep 21, 2024
@markseeto
Copy link
Contributor Author

@jangorecki Thanks for reviewing. I've made the changes you asked for.

Regarding the code-quality / lint-r check, I fixed some of them but I hope it will be OK to leave long variable names because they're more descriptive that way.

@jangorecki
Copy link
Member

Please avoid force push when PR is already under reviews

@markseeto
Copy link
Contributor Author

Please avoid force push when PR is already under reviews

Thanks for approving and thanks for this feedback. I don't know what "force push" means but will try to find out so I can avoid doing them.

@MichaelChirico
Copy link
Member

I don't know what "force push" means

Here's what we're seeing:

image

Most of the time when this happens you get an error when trying git push about mismatch of remote&local, I believe the error tells you to retry with --force. If you saw something about "incompatible histories" I've also been running into the same of late, not sure why. Just try and be cognizant that --force should be avoided when possible, it makes the history harder to track.

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 23, 2024

One other note, I see this is your markseeto:master branch in your fork, in my experience it's smart to make fork edits in a dedicated branch, there are some subtle headaches to dealing with PRs in fork branches named master.

Anyway, you'll get an invite to be a data.table project member, so you'll be able to create branches directly on this repo going forward, so You're already a member: https://github.com/orgs/Rdatatable/people/markseeto, I guess this PR was created before the invite. The point is kind of moot for data.table specifically, but it's good advice for PRs to other packages.

NEWS.md Outdated Show resolved Hide resolved
(is.atomic(label) && length(label) == 1L) ||
(is.list(label) && all(vapply_1b(label, is.atomic)) && all(lengths(label) == 1L) && !is.null(names(label)))))
stopf("Argument 'label', if not NULL, must be a scalar or a named list of scalars.")
if (is.list(label) && !is.null(names(label)) && ("" %chin% names(label) || anyNA(names(label))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this !is.null(names(label)) check redundant? Since we have is.list(label) && ... !is.null(names(label)) in the above requirement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do find it a bit surprising that the above check requires "a named list of scalars" but we have a separate test for "all list elements must be named", maybe best to add in the check for ""/NA names to the above condition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this as a small follow-up PR if you agree, don't want to hold the PR back further.

Copy link
Contributor Author

@markseeto markseeto Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this !is.null(names(label)) check redundant?

Yes, I think you're right.

I do find it a bit surprising that the above check requires "a named list of scalars" but we have a separate test for "all list elements must be named", maybe best to add in the check for ""/NA names to the above condition?

Maybe, but with separate checks the error messages can be more specific, the second one being for the situation where label is a named list but not all elements have a name. If we combine the error messages into one, it would be something like "Argument 'label', if not NULL, must be (1) a scalar, or (2) a named list with each element being named and each element being a scalar." Or "Argument 'label', if not NULL, must be (1) a scalar, or (2) a named list with no names being "" or NA and each element being a scalar." I think these are less clear and less helpful than having separate error messages depending on the situation. If you disagree, please let me know. It's not something I feel strongly about.

R/groupingsets.R Outdated Show resolved Hide resolved
R/groupingsets.R Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

I fixed some of them but I hope it will be OK to leave long variable names because they're more descriptive that way.

Honestly I think it's overdoing it here. Those variables have very limited scope for an error branch. I just used simple names idx/info here, I don't think the code is noticeably less readable.

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@MichaelChirico MichaelChirico merged commit 1494900 into Rdatatable:master Sep 23, 2024
7 of 8 checks passed
@markseeto
Copy link
Contributor Author

markseeto commented Sep 23, 2024

@MichaelChirico Thanks for checking and improving the code. I'll study your revised code and learn from it.

Here's what we're seeing:

image

Most of the time when this happens you get an error when trying git push about mismatch of remote&local, I believe the error tells you to retry with --force. If you saw something about "incompatible histories" I've also been running into the same of late, not sure why. Just try and be cognizant that --force should be avoided when possible, it makes the history harder to track.

Thanks for the explanation. I see that too. I used the GitHub GUI in the browser rather than command-line git (I haven't learned command-line git yet). I didn't explicitly select "force push", but just synced my fork and then made the changes, and GitHub automatically did the force push. Maybe I should have just changed the files without syncing the fork first.

One other note, I see this is your markseeto:master branch in your fork, in my experience it's smart to make fork edits in a dedicated branch, there are some subtle headaches to dealing with PRs in fork branches named master.

Thanks, I'll keep that in mind.

I guess this PR was created before the invite.

Yes, this PR was started before I was a member.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"total" label in cube/rollup/groupingsets
4 participants